Skip to content

fix: skip model/API-key validation for oracle agent#173

Merged
xdotli merged 2 commits intobenchflow-ai:mainfrom
EYH0602:fix/oracle-skip-api-key-check
Apr 21, 2026
Merged

fix: skip model/API-key validation for oracle agent#173
xdotli merged 2 commits intobenchflow-ai:mainfrom
EYH0602:fix/oracle-skip-api-key-check

Conversation

@EYH0602
Copy link
Copy Markdown
Contributor

@EYH0602 EYH0602 commented Apr 21, 2026

Closes #172

Summary

  • Oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for the CLI's default model (claude-haiku-4-5-20251001)
  • bench eval create -a oracle now works without ANTHROPIC_API_KEY set
  • One-line change: skip the model validation block when agent == "oracle"

Test plan

  • All 42 tests in test_resolve_env_helpers.py and test_subscription_auth.py pass
  • Run bench eval create -t <task> -a oracle without ANTHROPIC_API_KEY — should succeed
  • Run bench eval create -t <task> -a claude-agent-acp — API key validation still works as before

Open in Devin Review

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.
@EYH0602
Copy link
Copy Markdown
Contributor Author

EYH0602 commented Apr 21, 2026

CI failed on main, not triggered by this PR.

Copy link
Copy Markdown
Contributor Author

@EYH0602 EYH0602 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #173 fix is installed and works for its intended purpose (OAuth token as alternative auth), but it doesn't help the oracle case — oracle needs no model/key at all, yet benchflow always forces DEFAULT_MODEL = "claude-haiku-4-5-20251001" (in job.py:141). When you run bench eval create -t <task> -a oracle without -m, the CLI does model=model or DEFAULT_MODEL (cli/eval.py:235), which always falls back to haiku, triggering the API key validation even though oracle just runs solve.sh and never calls any LLM.

This is a separate bug worth tracking — the oracle agent should either skip model/API-key validation entirely or not have a model assigned by default.

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.
@EYH0602
Copy link
Copy Markdown
Contributor Author

EYH0602 commented Apr 21, 2026

Moved the fix to the CLI layer as suggested. Oracle now gets model=None instead of DEFAULT_MODEL — no model assigned, no validation triggered.

Changes in 360c460:

  • cli/eval.py: both _run_single and _run_batch pass model=None when agent == "oracle"
  • job.py: widen JobConfig.model to str | None to support this
  • _agent_env.py: reverted the agent != "oracle" skip — no longer needed since model is None

@xdotli
Copy link
Copy Markdown
Member

xdotli commented Apr 21, 2026

/review

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 2 additional findings.

Open in Devin Review

@xdotli xdotli merged commit a099ff9 into benchflow-ai:main Apr 21, 2026
1 of 2 checks passed
Copy link
Copy Markdown
Contributor Author

@EYH0602 EYH0602 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.

@xdotli
Copy link
Copy Markdown
Member

xdotli commented Apr 21, 2026

The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.

should we open another pr?

@EYH0602
Copy link
Copy Markdown
Contributor Author

EYH0602 commented Apr 21, 2026

The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.

should we open another pr?

The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.

should we open another pr?

The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.

should we open another pr?

I think so @xdotli , should I do it?

@xdotli
Copy link
Copy Markdown
Member

xdotli commented Apr 21, 2026 via email

xdotli pushed a commit that referenced this pull request Apr 22, 2026
PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.
xdotli pushed a commit that referenced this pull request Apr 22, 2026
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
@EYH0602 EYH0602 deleted the fix/oracle-skip-api-key-check branch April 22, 2026 07:13
xdotli added a commit that referenced this pull request Apr 25, 2026
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

---------

Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
xdotli added a commit that referenced this pull request Apr 25, 2026
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR #173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* docs(plan): add plan to fix sandbox io problem

* test: lock sandbox setup contract

Plan step 1/6: Lock the new sandbox contract in tests

* fix: stop copying root tool installs into sandbox home

Plan step 2/6: Narrow setup_sandbox_user() to user state only

* refactor: derive sandbox home dirs from registry config

Plan step 3/6: Align registry semantics with the new contract

* refactor: symlink skills into sandbox, enforce shared install prefixes

Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or
task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and
add registry + sandbox-setup invariants that keep agent binaries on
/usr/local/* rather than /root-only home paths. Updates task-authoring
and api-reference docs to describe the new lightweight sandbox contract.

* chore: remove completed sandbox plan doc

---------

Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
EYH0602 added a commit to EYH0602/benchflow that referenced this pull request Apr 25, 2026
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* docs: use `uv tool install` instead of `pip install`

benchflow is a CLI tool with entry points — uv tool install gives users
an isolated environment (like pipx) without managing venvs manually.

---------

Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
EYH0602 added a commit to EYH0602/benchflow that referenced this pull request Apr 25, 2026
* fix: skip model/API-key validation for oracle agent

The oracle agent runs solution/solve.sh and never calls an LLM, but
resolve_agent_env() was validating API keys for whatever model the CLI
defaulted to (claude-haiku-4-5-20251001). This made `bench eval create
-a oracle` fail without ANTHROPIC_API_KEY set, even though oracle
doesn't need it.

* fix: don't assign default model to oracle agent

Move the fix from resolve_agent_env to the CLI layer: oracle runs
solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL
at all. Both _run_single and _run_batch now pass model=None for oracle.
Widen JobConfig.model to str | None to support this.

* fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to
cli/eval.py, but cli/eval.py is orphaned (never imported into the live
CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and
trips ANTHROPIC_API_KEY validation. Three changes:

- Restore the `agent != "oracle"` guard in resolve_agent_env so the
  chokepoint defends against any caller that forwards a model.
- Delete the orphan cli/eval.py and its tests — the live eval_create
  lives in cli/main.py and was the actual code path users hit.
- Add effective_model(agent, model) helper, change JobConfig.model
  default to None, replace seven `model or DEFAULT_MODEL` sites in
  cli/main.py and job.py YAML loaders so oracle gets honest model=None
  end-to-end (in result/summary JSON, prints, and downstream Trial).

Regression test in test_resolve_env_helpers.py pins the chokepoint by
calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key
and no host auth — verified to fail on main with the user-facing
ANTHROPIC_API_KEY error and pass after the fix.

* test: regression suite pinning oracle chokepoint + orphan removal

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer
of the prior fix at the right altitude:

- TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no
  src/ file references benchflow.cli.eval, guarding against a future
  re-introduction that could swallow the next bug fix the same way.
- TestEvalCreateRouting — `bench eval create` callback lives in
  cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed.
- TestEffectiveModel — unit tests for the helper: oracle drops model,
  non-oracle falls back to DEFAULT_MODEL, empty string treated as unset.
- TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None
  for both native and Harbor formats; non-oracle backwards-compat
  preserved.
- TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle")
  with no API key in env does not raise. Mocks Trial.create and resets
  the asyncio loop after to avoid polluting pre-existing tests that use
  the deprecated asyncio.get_event_loop() pattern.

Verified to fail on main in the right shape: 9 of 14 fail (each pinning
a deleted/added behavior), 5 pass (asserting structural facts already
true). The CLI test fails on main with the user-reported error
"ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

* fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard

The previous commit deleted cli/eval.py and its tests as orphans, but
they are intentionally kept. Restore both from main, update eval.py to
use the effective_model() helper for the oracle chokepoint fix, and
replace the "module is gone" regression test with a guard that cli/main.py
does not import cli/eval (the actual invariant).

* docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI

* test: cover sandbox setup timeout wiring

* docs: document sandbox setup timeout

* feat: wire sandbox setup timeout through configs

`setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default
120s) but no live call site surfaced it — the knob was unreachable for
normal runs. Under heavy sandbox bootstrap (parallel containers copying
large tool caches into /home/<sandbox_user>) the 120s cap was hit with
no user override.

Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and
RuntimeConfig, and forward it through:
- trial YAML (`trial_config_from_dict`)
- job YAML (both native and Harbor-compatible loaders)
- `SDK.run(..., sandbox_setup_timeout=...)`
- `bench eval create --sandbox-setup-timeout`
- `Trial.install_agent()` into both `setup_sandbox_user()` call sites
  (oracle + normal agent)

The value is also recorded in the run's `config.json` snapshot to aid
post-hoc diagnosis. Default stays at 120s — this change is about making
the value configurable, not changing runtime behavior.

---------

Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com>
Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Oracle agent incorrectly requires ANTHROPIC_API_KEY

2 participants